Goto

Collaborating Authors

 similarity algorithm


Improving ICD-based semantic similarity by accounting for varying degrees of comorbidity

Schneider, Jan Janosch, Adler, Marius, Ammer-Herrmenau, Christoph, König, Alexander Otto, Sax, Ulrich, Hügel, Jonas

arXiv.org Artificial Intelligence

Finding similar patients is a common objective in precision medicine, facilitating treatment outcome assessment and clinical decision support. Choosing widely-available patient features and appropriate mathematical methods for similarity calculations is crucial. International Statistical Classification of Diseases and Related Health Problems (ICD) codes are used worldwide to encode diseases and are available for nearly all patients. Aggregated as sets consisting of primary and secondary diagnoses they can display a degree of comorbidity and reveal comorbidity patterns. It is possible to compute the similarity of patients based on their ICD codes by using semantic similarity algorithms. These algorithms have been traditionally evaluated using a single-term expert rated data set. However, real-word patient data often display varying degrees of documented comorbidities that might impair algorithm performance. To account for this, we present a scale term that considers documented comorbidity-variance. In this work, we compared the performance of 80 combinations of established algorithms in terms of semantic similarity based on ICD-code sets. The sets have been extracted from patients with a C25.X (pancreatic cancer) primary diagnosis and provide a variety of different combinations of ICD-codes. Using our scale term we yielded the best results with a combination of level-based information content, Leacock & Chodorow concept similarity and bipartite graph matching for the set similarities reaching a correlation of 0.75 with our expert's ground truth. Our results highlight the importance of accounting for comorbidity variance while demonstrating how well current semantic similarity algorithms perform.


A Comparison of Document Similarity Algorithms

Gahman, Nicholas, Elangovan, Vinayak

arXiv.org Artificial Intelligence

Document similarity is an important part of Natural Language Processing and is most commonly used for plagiarism-detection and text summarization. Thus, finding the overall most effective document similarity algorithm could have a major positive impact on the field of Natural Language Processing. This report sets out to examine the numerous document similarity algorithms, and determine which ones are the most useful. It addresses the most effective document similarity algorithm by categorizing them into 3 types of document similarity algorithms: statistical algorithms, neural networks, and corpus/knowledge-based algorithms. The most effective algorithms in each category are also compared in our work using a series of benchmark datasets and evaluations that test every possible area that each algorithm could be used in. NTRODUCTION Document similarity analysis is a Natural Language Processing (NLP) task where two or more documents are analyzed to recognize the similarities between these documents. Document similarity is heavily used in text summarization, recommender systems, plagiarism-detection as well as in search engines. Identifying the level of similarity or dissimilarity between two or more documents based on their content is the main objective of document similarity analysis.


Artificial Intelligence at eBay - Two Current Use-Cases

#artificialintelligence

Daniel Faggella is Head of Research at Emerj. Called upon by the United Nations, World Bank, INTERPOL, and leading enterprises, Daniel is a globally sought-after expert on the competitive strategy implications of AI for business and government leaders. The company that would become eBay was founded as a sole proprietorship under the name AuctionWeb in September 1995 by Pierre Omidyar. The company changed its name to eBay in September 1997. Today, eBay is a global e-commerce leader in more than 190 markets throughout the world.


User-friendly Comparison of Similarity Algorithms on Wikidata

Ilievski, Filip, Szekely, Pedro, Satyukov, Gleb, Singh, Amandeep

arXiv.org Artificial Intelligence

While the similarity between two concept words has been evaluated and studied for decades, much less attention has been devoted to algorithms that can compute the similarity of nodes in very large knowledge graphs, like Wikidata. To facilitate investigations and head-to-head comparisons of similarity algorithms on Wikidata, we present a user-friendly interface that allows flexible computation of similarity between Qnodes in Wikidata. At present, the similarity interface supports four algorithms, based on: graph embeddings (TransE, ComplEx), text embeddings (BERT), and class-based similarity. We demonstrate the behavior of the algorithms on representative examples about semantically similar, related, and entirely unrelated entity pairs. To support anticipated applications that require efficient similarity computations, like entity linking and recommendation, we also provide a REST API that can compute most similar neighbors for any Qnode in Wikidata.


Supervised machine learning techniques for data matching based on similarity metrics

Verschuuren, Pim, Palazzo, Serena, Powell, Tom, Sutton, Steve, Pilgrim, Alfred, Giannelli, Michele Faucci

arXiv.org Machine Learning

Businesses, governmental bodies and NGO's have an ever-increasing amount of data at their disposal from which they try to extract valuable information. Often, this needs to be done not only accurately but also within a short time frame. Clean and consistent data is therefore crucial. Data matching is the field that tries to identify instances in data that refer to the same real-world entity. In this study, machine learning techniques are combined with string similarity functions to the field of data matching. A dataset of invoices from a variety of businesses and organizations was preprocessed with a grouping scheme to reduce pair dimensionality and a set of similarity functions was used to quantify similarity between invoice pairs. The resulting invoice pair dataset was then used to train and validate a neural network and a boosted decision tree. The performance was compared with a solution from FISCAL Technologies as a benchmark against currently available deduplication solutions. Both the neural network and boosted decision tree showed equal to better performance.